Short samples in authorship attribution: a new approach

نویسنده

  • Maciej Eder
چکیده

The question of minimal sample size is one of the most important issues in stylometry and nontraditional authorship attribution. In the last decade or so, a few studies concerning different aspects of scalability in stylometry have been published (Zhao and Zobel, 2005; Hirst and Feiguina, 2007; Stamatatos, 2008; Koppel et al., 2009; Mikros, 2009; Luyckx and Daelemans, 2011), but the question has not been answered comprehensively. In his recent study, Eder proposed a systematic approach to solve the problem in a series of experiments, claiming that a sample should have at least 5,000 running words to be attributable (Eder, 2015). The above studies (and many other as well) tacitly assume that there exists a certain amount of linguistic data that allows for reliable authorial recognition, and the real problem at stake is to determine that very value. However, one can assume that the authorial fingerprint is not distributed evenly in a collection of texts. Just the contrary, many experiments seem to suggest that the authorial voice is sometimes overshadowed by other signals, such as genre, gender, chronology, or translation. Some authors, say Chandler, should be easily attributable, while some other authors, say Virginia Woolf, will probably have their fingerprint somewhat hidden. Moreover, authorship attribution is ultimately a matter of context: telling apart Hemingway and Dickens will always be easier than distinguishing the Bronte sisters. On theoretical grounds, then, the minimal sample size can not be determined once and forever for the entire corpus, but may be different for different texts in the corpus.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Profile-Based Authorship Attribution Approach to Forensic Identification in Chinese Online Messages

With the popularity of Internet technologies and applications, inappropriate or illegal online messages have become a problem for the society. The goal of authorship attribution for anonymous online messages is to identify the authorship from a group of potential suspects for investigation identification. Most previous contributions focused on extracting various writing-style features and emplo...

متن کامل

Authorship Attribution of Micro-Messages

Work on authorship attribution has traditionally focused on long texts. In this work, we tackle the question of whether the author of a very short text can be successfully identified. We use Twitter as an experimental testbed. We introduce the concept of an author’s unique “signature”, and show that such signatures are typical of many authors when writing very short texts. We also present a new...

متن کامل

Source Code Authorship Attribution Using Long Short-Term Memory Based Networks

Machine learning approaches to source code authorship attribution attempt to find statistical regularities in human-generated source code that can identify the author or authors of that code. This has applications in plagiarism detection, intellectual property infringement, and post-incident forensics in computer security. The introduction of features derived from the Abstract Syntax Tree (AST)...

متن کامل

Modality Specific Meta Features for Authorship Attribution in Web Forum Posts

This paper presents a new method for Authorship Attribution (AA) on online forum posts. The idea behind the method is to generate meta features that capture modality specific similarity relations among texts from different authors. Each modality refers to a particular linguistic dimension (syntactic, lexical, stylistic). To evaluate this approach we measure prediction accuracy on data from an o...

متن کامل

Suppοrting the Cybercrime Investigation Process: Effective Discrimination of Source Code Authors Based on Byte-level Information

Source code authorship analysis is the particular field that attempts to identify the author of a computer program by treating each program as a linguistically analyzable entity. This is usually based on other undisputed program samples from the same author. There are several cases where the application of such a method could be of a major benefit, such as tracing the source of code left in the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017